llmgist

On the development and validation of large language model-based classifiers for identifying social determinants of health

NLP Tasks: Text Classification, Information Extraction

Method: LLM-based classifiers using Bidirectional Encoder Representations from Transformers (BERT) and A Robustly Optimized BERT Pretraining Approach (RoBERTa)

Metrics:

Area under the receiver operating characteristics curve for homelessness (0.78)
Area under the receiver operating characteristics curve for food insecurity (0.72)
Area under the receiver operating characteristics curve for domestic violence (0.83)

Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer-Based Investigation

NLP Tasks: Text Classification, Information Extraction, Question Answering, Text Generation

Method: the generative pretrained transformer (GPT) model in specific GPT-3.5

Metrics:

Accuracy (high accuracy in zero-shot learning)
Recall (improved in few-shot learning)
F1-score (enhanced in few-shot learning)
Precision (lower in few-shot learning)

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant

NLP Tasks: Text Classification, Information Extraction, Question Answering

Method: Evaluation of ChatGPT, GPT-4, and LLaMA in identifying patients with specific diseases using gold-labeled Electronic Health Records (EHRs) from the MIMIC-III database.

Metrics:

F1-score (â‰¥ 85% on COPD, CKD, and PBC)
F1-score (4.23% higher for PBC compared to traditional Machine Learning models)
Precision
Specificity
Sensitivity
Negative Predictive Value

ARDSFlag: an NLP/machine learning algorithm to visualize and detect high-probability ARDS admissions independent of provider recognition and billing codes

NLP Tasks: Text Classification, Information Extraction

Method: ARDSFlag algorithm using machine learning (ML) and natural language processing (NLP) techniques

Metrics:

Accuracy (91.9%Â±0.5% for bilateral infiltrates, 86.1%Â±0.5% for heart failure/fluid overload in radiology reports, 98.4%Â±0.3% for echocardiogram notes)
Overall accuracy (89.0%)
Specificity (91.7%)
Recall (80.3%)
Precision (75.0%)

Redefining Health Care Data Interoperability: Empirical Exploration of Large Language Models in Information Exchange

NLP Tasks: Information Extraction, Text Generation

Method: text-based approach facilitated by the LLM ChatGPT

Metrics:

Accuracy (over 99%)
Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Constructing synthetic datasets with generative artificial intelligence to train large language models to classify acute renal failure from clinical notes

NLP Tasks: Text Classification, Information Extraction, Question Answering

Method: A classifier using language models to identify acute renal failure.

Metrics:

Accuracy (over 99%)
Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer

NLP Tasks: Text Classification, Information Extraction, Question Answering

Method: nan

Metrics:

Accuracy (over 99%)
Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation

NLP Tasks: Text Classification

Method: AI screening tool using the BioMed-RoBERTa model

Metrics:

Accuracy (over 99%)
Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Evaluation and mitigation of the limitations of large language models in clinical decision-making

NLP Tasks: Information Extraction, Text Classification, Question Answering

Method: Creating a framework to simulate a realistic clinical setting using a curated dataset based on the Medical Information Mart for Intensive Care database

Metrics:

Accuracy (over 99%)
Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

NLP Tasks: Information Extraction, Text Classification, Question Answering, Text Generation

Method: Evaluation of three popular large language models (LLMs): Bard, ChatGPT-3.5, and GPT-4, using various prompt strategies and a majority voting strategy.

Metrics:

Accuracy (over 99%)
Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

The shaky foundations of large language models and foundation models for electronic health records

NLP Tasks: Information Extraction, Text Classification, Question Answering

Method: conducting a narrative review and creating a taxonomy of foundation models trained on non-imaging EMR data

Metrics:

Accuracy (over 99%)
Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Company name

healthcare domain

Dataset: Medical Information Mart for Intensive Care-III (MIMIC-III)

On the development and validation of large language model-based classifiers for identifying social determinants of health

Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer-Based Investigation

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant

ARDSFlag: an NLP/machine learning algorithm to visualize and detect high-probability ARDS admissions independent of provider recognition and billing codes

Redefining Health Care Data Interoperability: Empirical Exploration of Large Language Models in Information Exchange

Constructing synthetic datasets with generative artificial intelligence to train large language models to classify acute renal failure from clinical notes

Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer

A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

The shaky foundations of large language models and foundation models for electronic health records